-
Notifications
You must be signed in to change notification settings - Fork 5.9k
add a fusion op: fused_residual_dropout_bias #34963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
paddle/fluid/operators/fused/test_fused_residual_dropout_bias.cu
Outdated
Show resolved
Hide resolved
paddle/fluid/operators/fused/test_fused_residual_dropout_bias.cu
Outdated
Show resolved
Hide resolved
paddle/fluid/operators/fused/test_fused_residual_dropout_bias.cu
Outdated
Show resolved
Hide resolved
paddle/fluid/operators/fused/test_fused_residual_dropout_bias.cu
Outdated
Show resolved
Hide resolved
paddle/fluid/operators/fused/test_fused_residual_dropout_bias.cu
Outdated
Show resolved
Hide resolved
|
LGTM |
|
LGTM |
|
LGTM |
Xreki
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
代码实现上可以考虑能不能进一步封装,加强代码重用。
| * @brief the fused function called by every thread | ||
| */ | ||
| template <typename T, typename MaskType, typename U, int VecSize, | ||
| bool layer_norm> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layer_norm -> ComputeLayerNorm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| const platform::CUDADeviceContext &ctx) { | ||
| // dropout_prob == 1.0f | ||
| if (std::abs(dropout_prob - 1.0f) < 1e-5) { | ||
| PADDLE_ENFORCE_CUDA_SUCCESS( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用memory::Copy,另外dst和residual有没有可能是同一个地址?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| PADDLE_ENFORCE_CUDA_SUCCESS( | ||
| cudaMemcpyAsync(dst, residual, rows * cols * sizeof(T), | ||
| cudaMemcpyDeviceToDevice, ctx.stream())); | ||
| PADDLE_ENFORCE_CUDA_SUCCESS(cudaMemsetAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用math::SetConstant
| const int VecSize = 4; | ||
| if (dbias != nullptr) { | ||
| int real_vec_size = VecSize; | ||
| if (cols % VecSize != 0) real_vec_size = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要写到一行
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| template <typename T> | ||
| struct TestFusedResidualDropoutBias { | ||
| uint32_t _rows; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
类的成员变量命名用xxx_的方式,struct可不加_后缀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| #include <curand_kernel.h> | ||
|
|
||
| #include <iostream> | ||
| #include <memory> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这2个头文件没有用到吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| @@ -0,0 +1,70 @@ | |||
| /* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fused_dropout.h文件名不合适。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成fused_dropout_common.h了
| const platform::CUDADeviceContext &ctx, const uint64_t n) { | ||
| const uint64_t tmp_n = n / VecSize; | ||
| int threads = std::max( | ||
| (uint64_t)32, std::min(tmp_n, (uint64_t)ctx.GetMaxThreadsPerBlock())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
类型转换用static_cast
| /** | ||
| * get 1D threads and blocks | ||
| */ | ||
| template <int VecSize = 4> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里VecSize可以通过参数传,另外可以考虑使用中的数据结构和接口:
Paddle/paddle/fluid/platform/gpu_launch_config.h
Lines 36 to 44 in 5eefc8c
| struct GpuLaunchConfig { | |
| dim3 theory_thread_count = dim3(1, 1, 1); | |
| dim3 thread_per_block = dim3(1, 1, 1); | |
| dim3 block_per_grid = dim3(1, 1, 1); | |
| int compute_capability = 0; | |
| }; | |
| inline GpuLaunchConfig GetGpuLaunchConfig1D( | |
| const platform::CUDADeviceContext& context, int64_t element_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| } | ||
|
|
||
| // aligned vector generates vectorized load/store on CUDA | ||
| template <typename T, int VecSize> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有必要重复定义,也可以直接引用这里的实现:
Paddle/paddle/fluid/platform/fast_divmod.h
Lines 25 to 28 in 5eefc8c
| template <typename T, int Size> | |
| struct alignas(sizeof(T) * Size) CudaAlignedVector { | |
| T val[Size]; | |
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| * | ||
| */ | ||
|
|
||
| /********Forward**************/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L27 - L32的注释没什么意义,建议删除。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| namespace platform = paddle::platform; | ||
| namespace cg = cooperative_groups; | ||
| namespace memory = paddle::memory; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个namespace别名没有必要,代码中本来就可以直接用platform::float16、memory::Copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| if (bias != nullptr) | ||
| *bias_value = *reinterpret_cast<const LoadT *>(&bias[col_id]); | ||
|
|
||
| float4 rand = curand_uniform4(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果VecSize为1,这里是不是会有问题,或者性能不好?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里无论VecSize是几,都产生4个随机数,也不太合理。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已改
| } | ||
| } | ||
|
|
||
| /********Backward**************/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要这种注释。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| * 2. save 128*8 temporary sum in 8*128 shared memory | ||
| * 3. reduce the sum of 128 rows data by 8*VecSize warps | ||
| */ | ||
| template <typename T, typename MaskType, int BLOCK_SIZE_X, int BLOCK_SIZE_Y, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BLOCK_SIZE_X -> BlockSizeX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| } | ||
| } | ||
|
|
||
| // save temporary sum to cache and do transpose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这是为了做Block级别的reduce吗,可以调用下面的实现吗:
Paddle/paddle/fluid/operators/kernel_primitives/compute_primitives.h
Lines 87 to 88 in 7743cdf
| template <typename T, typename ReduceOp> | |
| __device__ __forceinline__ T BlockXReduce(T val, ReduceOp reducer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有点不太一样,我这边是一个block里面做8路的reduce:block大小是8*128个线程,每128个线程做一列数据的reduce,输出8个sum
| }; | ||
|
|
||
| TEST(FusedDropout, GPUFusedResidualDropoutBias) { | ||
| const int rows = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一些不同的测试配置,可以通过for循环写到一个TEST里面。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
@zhangting2020 也来帮忙review一下吧。 |
| is_test = false; | ||
| hasbias = true; | ||
| platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance(); | ||
| auto devicectx = pool.Get(place); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
命名建议:
hasbias->has_bias
devicectx->device_ctx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| rows, cols, dropout_prob, is_upscale_in_train, src, residual, bias, | ||
| dst); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
上面这段函数的调用,if (cols % VecSize != 0) 的判断,看上去只影响了VecSize的设置,有更好的方式解决这种重复的代码吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VecSize是个模板参数,没找到其他方法。
| if (bias != nullptr) | ||
| *bias_value = *reinterpret_cast<const LoadT *>(&bias[col_id]); | ||
|
|
||
| float4 rand = curand_uniform4(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里无论VecSize是几,都产生4个随机数,也不太合理。
|
|
||
| /** | ||
| * @brief call paddle dropout op | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个test是测试原来的dropout op吗?我看到还有一个fuse_dropout_op的test。所以没太明白这个单测的用意?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个头文件是给几个dropout相关的单测共用的,里面call了下dropout_op,作为对比的base版本。
Xreki
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. 一些代码优化建议,可以后续PR再看下。
| */ | ||
| inline platform::GpuLaunchConfig Get1DBlocksAnd2DGrids( | ||
| const platform::CUDADeviceContext &ctx, const uint32_t rows, | ||
| const uint32_t cols, const int VecSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
变量名命名用xxx_xxx方式:VecSize -> vec_size
| } | ||
|
|
||
| __forceinline__ __device__ void RandVec(curandStatePhilox4_32_10_t *state, | ||
| float *data, const int VecSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上。
| return config; | ||
| } | ||
|
|
||
| __forceinline__ __device__ void Rand1(curandStatePhilox4_32_10_t *state, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我觉得写成模板函数、再特化的方式会好一些。
| T factor = static_cast<T>(1.0f / (1.0f - dropout_prob)); | ||
| if (!is_upscale_in_train) { | ||
| factor = static_cast<T>(1.0f); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L109 - L112可以用? xx : xx运算符写成一行。
| factor = static_cast<T>(1.0f - dropout_prob); | ||
| if (is_upscale_in_train) { | ||
| factor = static_cast<T>(1.0f); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L114 - L117可以用? xx : xx运算符写成一行。
| test.has_bias = has_bias[j]; | ||
| test.Run(); | ||
| test.CheckOut(default_diff); | ||
| if (!is_fp16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fp16不检查梯度吗?
| T default_diff = static_cast<T>(1e-5); | ||
| if (is_fp16) { | ||
| default_diff = static_cast<T>(1e-2); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L258 - 260也可简化成一行。
| default_diff = static_cast<T>(1e-2); | ||
| } | ||
| for (int i = 0; i < cols_list.size(); i++) { | ||
| for (int j = 0; j < 2; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
此处可写成如下:
for (auto col : {16, 17}) {
for (auto has_bias : {true, false}) {
...
}
}| TEST(FusedDropout, GPUFusedResidualDropoutBias3) { | ||
| const int rows = 16; | ||
| const int cols = 16; | ||
| TestFusedResidualDropoutBias<float> test(rows, cols, 0, 1.0, true, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPUFusedResidualDropoutBias2和GPUFusedResidualDropoutBias3也只是一个参数设置的差别,都可以合并写到一块。
| } | ||
|
|
||
| // test large shape | ||
| TEST(FusedDropout, GPUFusedResidualDropoutBias6) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单测名也具化一下,表明实际测的是什么。
PR types
New features
PR changes
OPs
Describe
Fused elementwise_add, dropout and elementwise_add into one operator